Data Visualization

Gerko Vink

Methodology & Statistics @ Utrecht University

12 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Recap

Gisteren hebben we deze onderwerpen behandeld:

  • Ontbrekende waarden identificeren
  • Synthetische imputaties maken

Today

Vandaag behandelen we de volgende onderwerpen:

  • Basisplots: histogrammen, scatterplots en boxplots
  • Geavanceerde plots met ggplot2
  • Aanpassen van grafieken voor publicatie
  • Exporteren van grafieken en resultaten

We use the following packages

library(mice)     # Boys dataset
library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(ggplot2)  # Plotting suite

Why visualise?

  • We can process a lot of information quickly with our eyes
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions
  • Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

Source: https://www.autodeskresearch.com/publications/samestats

Base R Plots

Histogram

hist(boys$hgt, main = "Histogram", xlab = "Height")

Density

dens <- density(boys$hgt, na.rm = TRUE)
plot(dens, main = "Density plot", xlab = "Height", bty = "L")

Scatter plot

plot(x = boys$hgt, y = boys$wgt, main = "Scatter plot", 
     xlab = "Height", ylab = "Weight", bty = "L")

Box plot

boxplot(boys$hgt ~ boys$reg, main = "Boxplot", 
        xlab = "Region", ylab = "Height")

Many R objects also have a plot() method

boys %$% lm(age~wgt) %>% plot()

Neat! But what if we want more control?

ggplot2

What is ggplot2?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

boys %>%
  ggplot()

2: map variable to aesthetics

boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?

Aesthetics

  • x
  • y
  • size
  • colour
  • fill
  • opacity (alpha)
  • linetype

Aesthetics

gg <- 
  boys %>% 
  filter(!is.na(reg)) %>% 
  
  ggplot(aes(x      = age, 
             y      = bmi, 
             size   = hc, 
             colour = reg)) +
  
  geom_point(alpha = 0.5) +
  
  labs(title  = "BMI trend for boys",
       x      = "Age", 
       y      = "BMI", 
       size   = "Head circumference",
       colour = "Region")

Aesthetics

plot(gg)

Geoms

  • geom_point

  • geom_bar

  • geom_line

  • geom_smooth

  • geom_histogram

  • geom_boxplot

  • geom_density

Geoms: Bar


data.frame(x = letters[1:5], 
           y = c(1, 3, 3, 2, 1)) %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_bar(fill = "dark green", 
           stat = "identity") +
  labs(title = "Value per letter",
       x     = "Letter", 
       y     = "Value")

Geoms: Line


ggdat <- data.frame(x = 1:100, 
                    y = rnorm(100))
ggdat %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_line(colour = "dark green", 
            lwd = 1) +
  ylim(-2, 3.5) +
  labs(title = "Some line thing",
       x     = "Some x label", 
       y     = "Some value")

Geoms: Smooth


ggdat %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_smooth(colour = "dark green", 
              lwd = 1, 
              se = TRUE) +
  ylim(-2, 3.5) +
  labs(title = "Some line thing",
       x     = "Some x label", 
       y     = "Some value")

Geoms: Boxplot


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = reg, 
             y = bmi, 
             fill = reg)) +
  geom_boxplot() +
  labs(title = "BMI across regions",
       x     = "Region", 
       y     = "BMI")

Geoms: Boxplot without legend


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = reg, 
             y = bmi, 
             fill = reg)) +
  geom_boxplot() +
  labs(title = "BMI across regions",
       x     = "Region", 
       y     = "BMI") +
  theme(legend.position = "none")

Geoms: Density


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = hgt, fill = reg)) +
  geom_density(alpha = 0.5, 
               colour = "transparent") +
  xlim(0, 250) + 
  labs(title = "Height across regions",
       x     = "Height", 
       fill  = "Region")

Changing the Style: Themes

  • Themes determine the overall appearance of your plot
  • standard themes: e.g., theme_minimal(), theme_classic(), theme_bw(), …
  • extra libraries with additional themes: e.g., ggthemes
  • customize own theme using options of theme()

Changing the Style: Themes


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = hgt, fill = reg)) +
  geom_density(alpha = 0.5, 
               colour = "transparent") +
  xlim(0, 250) + 
  labs(title = "Height across regions",
       x     = "Height", 
       fill  = "Region") +
  theme_minimal()

Changing the Style: Themes


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = hgt, fill = reg)) +
  geom_density(alpha = 0.5, 
               colour = "transparent") +
  xlim(0, 250) + 
  labs(title = "Height across regions",
       x     = "Height", 
       fill  = "Region") +
  theme_gray()

Changing the Style: Themes


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = hgt, fill = reg)) +
  geom_density(alpha = 0.5, 
               colour = "transparent") +
  xlim(0, 250) + 
  labs(title = "Height across regions",
       x     = "Height", 
       fill  = "Region") +
  theme_classic()

Changing the Style: Themes


boys %>% 
  filter(!is.na(reg)) %>% 
  ggplot(aes(x = hgt, fill = reg)) +
  geom_density(alpha = 0.5, 
               colour = "transparent") +
  xlim(0, 250) + 
  labs(title = "Height across regions",
       x     = "Height", 
       fill  = "Region") +
  theme_bw()

Interactive plots

Use plotly::ggplotly() to make any ggplot interactive

library(plotly)
gg <- boys %>% 
  filter(!is.na(reg)) %>% 
  
  ggplot(aes(x      = age, 
             y      = bmi, 
             colour = reg)) +
  
  geom_point(alpha = 0.5) +
  
  labs(title  = "BMI trend for boys",
       x      = "Age", 
       y      = "BMI", 
       colour = "Region") + 
  theme_minimal()

ggplotly(gg)

Interactive plots

Use plotly::ggplotly() to make any ggplot interactive

Practical